Quality scores and SNP detection in sequencing-by-synthesis systems.

نویسندگان

  • William Brockman
  • Pablo Alvarez
  • Sarah Young
  • Manuel Garber
  • Georgia Giannoukos
  • William L Lee
  • Carsten Russ
  • Eric S Lander
  • Chad Nusbaum
  • David B Jaffe
چکیده

Promising new sequencing technologies, based on sequencing-by-synthesis (SBS), are starting to deliver large amounts of DNA sequence at very low cost. Polymorphism detection is a key application. We describe general methods for improved quality scores and accurate automated polymorphism detection, and apply them to data from the Roche (454) Genome Sequencer 20. We assess our methods using known-truth data sets, which is critical to the validity of the assessments. We developed informative, base-by-base error predictors for this sequencer and used a variant of the phred binning algorithm to combine them into a single empirically derived quality score. These quality scores are more useful than those produced by the system software: They both better predict actual error rates and identify many more high-quality bases. We developed a SNP detection method, with variants for low coverage, high coverage, and PCR amplicon applications, and evaluated it on known-truth data sets. We demonstrate good specificity in single reads, and excellent specificity (no false positives in 215 kb of genome) in high-coverage data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Engineering a high-performance SNP detection pipeline

We present Sprite, a bioinformatic data analysis pipeline for detecting single nucleotide polymorphisms (SNPs) in the human genome. A SNP detection pipeline for next-generation sequencing data uses several software tools, including tools for read preprocessing, read alignment, and SNP calling. We target end-to-end scalability and I/O efficiency in Sprite by merging tools in this pipeline and el...

متن کامل

List of Publications Faculty Opponent

This thesis is based on the following publications, which are referred to in the text by their Roman numerals. " Silhouette " scores for assessment of SNP genotype clusters. BMC Genom-ics.resolution, high-throughput SNP mapping in Drosophila melanogaster. Positional cloning by fast-track SNP-mapping in Droso-phila melanogaster. Dried reagents for multiplex genotyping by tag-array minisequencing...

متن کامل

SNP detection for massively parallel whole-genome resequencing.

Next-generation massively parallel sequencing technologies provide ultrahigh throughput at two orders of magnitude lower unit cost than capillary Sanger sequencing technology. One of the key applications of next-generation sequencing is studying genetic variation between individuals using whole-genome or target region resequencing. Here, we have developed a consensus-calling and SNP-detection m...

متن کامل

Effect of lossy compression of quality scores on variant calling

Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the gr...

متن کامل

An Evaluation of Power to Detect Low-Frequency Variant Associations Using Allele-Matching Tests that Account for Uncertainty

There is growing interest in the role of rare variants in multifactorial disease etiology, and increasing evidence that rare variants are associated with complex traits. Single SNP tests are underpowered in rare variant association analyses, so locus-based tests must be used. Quality scores at both the SNP and genotype level are available for sequencing data and they are rarely accounted for. A...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Genome research

دوره 18 5  شماره 

صفحات  -

تاریخ انتشار 2008